Data Manipulation with tidyverse

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

dplyr is a core tidyverse package that provides a consistent and intuitive grammar for data manipulation.

1 Load Packages and Data

First, we load the tidyverse suite of packages.

Code
library(tidyverse)

We will use the built-in mtcars dataset for these examples. For simplicity, we select a few columns and convert the row names to a proper column called car_name.

Code
data(mtcars)
small_mtcars <- mtcars %>% 
  select(cyl, mpg, hp) %>% 
  head()

small_mtcars <- rownames_to_column(small_mtcars, var = "car_name")

small_mtcars
           car_name cyl  mpg  hp
1         Mazda RX4   6 21.0 110
2     Mazda RX4 Wag   6 21.0 110
3        Datsun 710   4 22.8  93
4    Hornet 4 Drive   6 21.4 110
5 Hornet Sportabout   8 18.7 175
6           Valiant   6 18.1 105

2 Data Manipulation with dplyr

2.1 Select Columns

2.1.1 Get Column Names

The names() function returns a character vector of the column names in a data frame.

Code
names(small_mtcars)
[1] "car_name" "cyl"      "mpg"      "hp"      

2.1.2 Select by Name

The select() verb chooses columns by name.

Code
small_mtcars %>% select(cyl, mpg, hp) 
  cyl  mpg  hp
1   6 21.0 110
2   6 21.0 110
3   4 22.8  93
4   6 21.4 110
5   8 18.7 175
6   6 18.1 105

2.1.3 Select with Helpers

select() can be used with helper functions like contains() to choose columns based on patterns in their names.

Code
small_mtcars %>% select(contains("p")) 
   mpg  hp
1 21.0 110
2 21.0 110
3 22.8  93
4 21.4 110
5 18.7 175
6 18.1 105

2.1.4 Select by Index

You can also select columns by their position.

Code
small_mtcars %>% select(1, 2) 
           car_name cyl
1         Mazda RX4   6
2     Mazda RX4 Wag   6
3        Datsun 710   4
4    Hornet 4 Drive   6
5 Hornet Sportabout   8
6           Valiant   6

2.1.5 Drop Columns

Use the - sign to deselect or drop columns.

Code
small_mtcars %>% select(-cyl)
           car_name  mpg  hp
1         Mazda RX4 21.0 110
2     Mazda RX4 Wag 21.0 110
3        Datsun 710 22.8  93
4    Hornet 4 Drive 21.4 110
5 Hornet Sportabout 18.7 175
6           Valiant 18.1 105

2.2 Rename Columns

The rename() verb changes the name of a column.

Code
small_mtcars %>% rename(new_cyl = cyl)
           car_name new_cyl  mpg  hp
1         Mazda RX4       6 21.0 110
2     Mazda RX4 Wag       6 21.0 110
3        Datsun 710       4 22.8  93
4    Hornet 4 Drive       6 21.4 110
5 Hornet Sportabout       8 18.7 175
6           Valiant       6 18.1 105

2.3 Create Columns

2.3.1 Mutate

mutate() adds new columns or transforms existing ones.

Code
small_mtcars %>% mutate(new_cyl = cyl + 1)
           car_name cyl  mpg  hp new_cyl
1         Mazda RX4   6 21.0 110       7
2     Mazda RX4 Wag   6 21.0 110       7
3        Datsun 710   4 22.8  93       5
4    Hornet 4 Drive   6 21.4 110       7
5 Hornet Sportabout   8 18.7 175       9
6           Valiant   6 18.1 105       7

2.3.2 Conditional Creation with if_else

if_else() is a vectorized conditional statement, useful inside mutate().

Code
small_mtcars %>% mutate(new_cly_group = if_else(cyl > 6, 'big', 'small'))
           car_name cyl  mpg  hp new_cly_group
1         Mazda RX4   6 21.0 110         small
2     Mazda RX4 Wag   6 21.0 110         small
3        Datsun 710   4 22.8  93         small
4    Hornet 4 Drive   6 21.4 110         small
5 Hornet Sportabout   8 18.7 175           big
6           Valiant   6 18.1 105         small

2.3.3 Conditional Creation with case_when

case_when() is useful for multiple conditions (like a multi-part if-else statement).

Code
small_mtcars %>% mutate(cly_group = case_when(
    cyl > 6 ~ "very big",
    cyl > 4 ~ "big",
    TRUE ~ "other"
  ))
           car_name cyl  mpg  hp cly_group
1         Mazda RX4   6 21.0 110       big
2     Mazda RX4 Wag   6 21.0 110       big
3        Datsun 710   4 22.8  93     other
4    Hornet 4 Drive   6 21.4 110       big
5 Hornet Sportabout   8 18.7 175  very big
6           Valiant   6 18.1 105       big

2.3.4 Transmute

transmute() is like mutate(), but it drops all other columns, keeping only the newly created ones.

Code
small_mtcars %>% transmute(new_cyl = cyl + 1)
  new_cyl
1       7
2       7
3       5
4       7
5       9
6       7

2.4 Filter Rows

filter() subsets rows based on logical conditions.

Code
small_mtcars %>% filter(cyl > 5)
           car_name cyl  mpg  hp
1         Mazda RX4   6 21.0 110
2     Mazda RX4 Wag   6 21.0 110
3    Hornet 4 Drive   6 21.4 110
4 Hornet Sportabout   8 18.7 175
5           Valiant   6 18.1 105

2.4.1 Filters with AND conditions

Separate conditions with a comma for an AND relationship.

Code
small_mtcars %>% filter(cyl > 5, mpg > 20)
        car_name cyl  mpg  hp
1      Mazda RX4   6 21.0 110
2  Mazda RX4 Wag   6 21.0 110
3 Hornet 4 Drive   6 21.4 110

2.4.2 Filters with OR conditions

Use the | operator for an OR relationship.

Code
small_mtcars %>% filter(cyl > 5 | mpg > 20)
           car_name cyl  mpg  hp
1         Mazda RX4   6 21.0 110
2     Mazda RX4 Wag   6 21.0 110
3        Datsun 710   4 22.8  93
4    Hornet 4 Drive   6 21.4 110
5 Hornet Sportabout   8 18.7 175
6           Valiant   6 18.1 105

2.4.3 Filter Rows by Index with slice

slice() allows you to select rows by their position.

Select the 5th row:

Code
small_mtcars %>% slice(5)
           car_name cyl  mpg  hp
1 Hornet Sportabout   8 18.7 175

Select a random sample of rows:

Code
small_mtcars %>% sample_n(3)
           car_name cyl  mpg  hp
1 Hornet Sportabout   8 18.7 175
2         Mazda RX4   6 21.0 110
3        Datsun 710   4 22.8  93

2.5 Handle Missing Data

Let’s create a sample data frame with missing values (NA).

Code
missing_df <- data.frame(
  x = c(1, 2, NA, 4),
  y = c("a", NA, "c", "d")
)

2.5.1 Find Missing Data

is.na() returns a logical vector indicating which values are missing.

Code
is.na(missing_df)
         x     y
[1,] FALSE FALSE
[2,] FALSE  TRUE
[3,]  TRUE FALSE
[4,] FALSE FALSE

2.5.2 Filter Out Missing Data

You can use filter() with !is.na() to remove rows with missing values in a specific column.

Code
missing_df %>% filter(!is.na(y))
   x y
1  1 a
2 NA c
3  4 d

tidyr::drop_na() removes rows with any missing values.

Code
missing_df %>% drop_na()
  x y
1 1 a
2 4 d

2.6 Group and Summarize Data

group_by() groups the data by one or more variables. summarise() then collapses each group into a single-row summary.

2.6.1 Common Summary Functions

Calculate the mean, min, max, and sum of mpg for each cylinder group.

Code
mtcars %>% 
  group_by(cyl) %>% 
  summarise(
    avg_mpg = mean(mpg),
    min_mpg = min(mpg),
    max_mpg = max(mpg),
    sum_mpg = sum(mpg)
  )
# A tibble: 3 × 5
    cyl avg_mpg min_mpg max_mpg sum_mpg
  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1     4    26.7    21.4    33.9    293.
2     6    19.7    17.8    21.4    138.
3     8    15.1    10.4    19.2    211.

2.6.2 Count Records

n() counts the total number of records in each group, while n_distinct() counts the number of unique values.

Code
mtcars %>% 
  group_by(cyl) %>% 
  summarise(
    n_mpg = n(),
    distinct_n_mpg = n_distinct(mpg)
  )
# A tibble: 3 × 3
    cyl n_mpg distinct_n_mpg
  <dbl> <int>          <int>
1     4    11              9
2     6     7              6
3     8    14             12

2.7 Window Functions

Window functions are functions that operate on a “window” of data (e.g., within a group) but, unlike summarise(), they return a value for every row.

2.7.1 Ranking with row_number

row_number() assigns a unique rank to each row within a group.

Code
small_mtcars %>% 
  group_by(cyl) %>% 
  mutate(rank = row_number(desc(mpg))) %>% 
  select(cyl, mpg, rank)
# A tibble: 6 × 3
# Groups:   cyl [3]
    cyl   mpg  rank
  <dbl> <dbl> <int>
1     6  21       2
2     6  21       3
3     4  22.8     1
4     6  21.4     1
5     8  18.7     1
6     6  18.1     4

2.7.2 Accessing Previous/Next Values with lag and lead

lag() and lead() are useful for comparing a value to its predecessor or successor.

Code
small_mtcars %>% 
  select(cyl, mpg) %>% 
  mutate(mpg_previous = lag(mpg, n = 1))
  cyl  mpg mpg_previous
1   6 21.0           NA
2   6 21.0         21.0
3   4 22.8         21.0
4   6 21.4         22.8
5   8 18.7         21.4
6   6 18.1         18.7

2.7.3 Cumulative Summaries

cumsum() calculates the cumulative sum.

Code
small_mtcars %>% 
  select(cyl, mpg) %>% 
  mutate(mpg_running_total = cumsum(mpg))
  cyl  mpg mpg_running_total
1   6 21.0              21.0
2   6 21.0              42.0
3   4 22.8              64.8
4   6 21.4              86.2
5   8 18.7             104.9
6   6 18.1             123.0

2.8 Order Rows

arrange() sorts the rows of a data frame by one or more columns.

Code
small_mtcars %>% arrange(cyl) 
           car_name cyl  mpg  hp
1        Datsun 710   4 22.8  93
2         Mazda RX4   6 21.0 110
3     Mazda RX4 Wag   6 21.0 110
4    Hornet 4 Drive   6 21.4 110
5           Valiant   6 18.1 105
6 Hornet Sportabout   8 18.7 175

Sort in descending order using desc().

Code
small_mtcars %>% arrange(desc(cyl))
           car_name cyl  mpg  hp
1 Hornet Sportabout   8 18.7 175
2         Mazda RX4   6 21.0 110
3     Mazda RX4 Wag   6 21.0 110
4    Hornet 4 Drive   6 21.4 110
5           Valiant   6 18.1 105
6        Datsun 710   4 22.8  93

2.9 Join Tables

First, let’s create two small data frames to demonstrate joins.

Code
left_data = small_mtcars %>% slice(1:3)
right_data = small_mtcars %>% slice(2:4)

2.9.1 Inner Join

inner_join() returns only the rows where the key exists in both tables.

Code
left_data %>% inner_join(right_data, by = "car_name")
       car_name cyl.x mpg.x hp.x cyl.y mpg.y hp.y
1 Mazda RX4 Wag     6  21.0  110     6  21.0  110
2    Datsun 710     4  22.8   93     4  22.8   93

2.9.2 Left Join

left_join() returns all rows from the left table, and matching rows from the right table.

Code
left_data %>% left_join(right_data, by = "car_name", suffix = c("_l", "_r"))
       car_name cyl_l mpg_l hp_l cyl_r mpg_r hp_r
1     Mazda RX4     6  21.0  110    NA    NA   NA
2 Mazda RX4 Wag     6  21.0  110     6  21.0  110
3    Datsun 710     4  22.8   93     4  22.8   93

2.9.3 Anti Join

anti_join() returns all rows from the left table that do not have a match in the right table.

Code
left_data %>% anti_join(right_data, by = "car_name")
   car_name cyl mpg  hp
1 Mazda RX4   6  21 110

2.10 Reshape Tables

Reshaping data involves converting it between “wide” and “long” formats.

Code
olddata_wide <- read.table(header = TRUE, text = '
 subject sex control cond1 cond2
       1   M     7.9  12.3  10.7
       2   F     6.3  10.6  11.1
       3   F     9.5  13.1  13.8
       4   M    11.5  13.4  12.9
')

2.10.1 Wide to Long with pivot_longer

pivot_longer() makes data “longer” by gathering multiple columns into key-value pairs.

Code
data_long = olddata_wide %>% 
  pivot_longer(!c(subject, sex), names_to = 'condition', values_to = 'measurement')

data_long
# A tibble: 12 × 4
   subject sex   condition measurement
     <int> <chr> <chr>           <dbl>
 1       1 M     control           7.9
 2       1 M     cond1            12.3
 3       1 M     cond2            10.7
 4       2 F     control           6.3
 5       2 F     cond1            10.6
 6       2 F     cond2            11.1
 7       3 F     control           9.5
 8       3 F     cond1            13.1
 9       3 F     cond2            13.8
10       4 M     control          11.5
11       4 M     cond1            13.4
12       4 M     cond2            12.9

2.10.2 Long to Wide with pivot_wider

pivot_wider() does the opposite, making data “wider” by spreading a key-value pair into multiple columns.

Code
data_wide = data_long %>% 
  pivot_wider(names_from = condition, values_from = measurement)

data_wide
# A tibble: 4 × 5
  subject sex   control cond1 cond2
    <int> <chr>   <dbl> <dbl> <dbl>
1       1 M         7.9  12.3  10.7
2       2 F         6.3  10.6  11.1
3       3 F         9.5  13.1  13.8
4       4 M        11.5  13.4  12.9

3 String Manipulation with stringr

stringr provides a consistent and user-friendly interface for common string operations.

3.1 String Operations

Code
x <- "I like horses."
str_length(x)
[1] 14
Code
str_to_upper(x)
[1] "I LIKE HORSES."

3.2 Pattern Matching

str_detect() checks for the presence of a pattern and returns a logical vector.

Code
word_list = c('abc', 'bbc', 'apple', 'bbaa', 'cc')
word_list %>% str_detect('a')
[1]  TRUE FALSE  TRUE  TRUE FALSE

str_replace_all() replaces all matches of a pattern with a new string.

Code
text001 = "abcb"
text001 %>% str_replace_all('b', '1')
[1] "a1c1"

3.3 Splitting Strings

tidyr::separate() splits a column into multiple new columns based on a delimiter.

Code
df_word <- data.frame(word = c('a-b', '1-c', 'c-c'))
df_word %>% separate(word, c('col1', 'col2'), '-')
  col1 col2
1    a    b
2    1    c
3    c    c

3.4 Extracting Substrings

str_extract() pulls out the first match of a regular expression.

Code
trx = 'abc1993 ccc'
trx %>% str_extract("(\\d)+")
[1] "1993"

4 Date and Time Manipulation with lubridate

lubridate simplifies working with dates and times in R.

Code
library(lubridate)

4.1 Parsing Dates

lubridate provides helper functions like ymd() to parse strings into dates reliably.

Code
date3 = ymd('2023-01-01')
class(date3)
[1] "Date"

4.2 Date Components

Easily extract components like year, month, or day from a date object.

Code
now_time = now()
year(now_time)
[1] 2025
Code
month(now_time)
[1] 7
Code
wday(now_time, label = TRUE)
[1] Tue
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

4.3 Date Arithmetic

Perform calculations with date objects.

Code
day1 = ymd('2022-01-01')
day2 = ymd('2023-02-03')

# Calculate the time difference
interval(day1, day2) %/% months(1)
[1] 13

5 dataframe to other data format

5.1 dataframe to vector

Code
data=small_mtcars$cyl
data
[1] 6 6 4 6 8 6
Code
class(data)
[1] "numeric"

5.2 dataframe to matrix

Code
data=data.matrix(small_mtcars)
data
     car_name cyl  mpg  hp
[1,]        4   6 21.0 110
[2,]        5   6 21.0 110
[3,]        1   4 22.8  93
[4,]        2   6 21.4 110
[5,]        3   8 18.7 175
[6,]        6   6 18.1 105
Code
class(data)
[1] "matrix" "array" 

5.3 dataframe to list

Code
data=as.list(small_mtcars)
data
$car_name
[1] "Mazda RX4"         "Mazda RX4 Wag"     "Datsun 710"       
[4] "Hornet 4 Drive"    "Hornet Sportabout" "Valiant"          

$cyl
[1] 6 6 4 6 8 6

$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1

$hp
[1] 110 110  93 110 175 105
Code
class(data)
[1] "list"

6 Reference:

https://dplyr.tidyverse.org/

https://stringr.tidyverse.org/

https://lubridate.tidyverse.org/

Back to top